Add ontology support to dataset features endpoint by hxrshxz · Pull Request #262 · openml/server-api

hxrshxz · 2026-02-28T17:43:24Z

Closes #237

Fetches ontology URIs from data_feature_description and includes them in GET /datasets/features/{dataset_id} responses, achieving feature parity with the PHP API.

coderabbitai · 2026-02-28T17:43:52Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2ba2d4e5-f926-4b5c-8d5f-45a59eb13815

📥 Commits

Reviewing files that changed from the base of the PR and between ab592f7 and 0d79669.

📒 Files selected for processing (2)

tests/routers/openml/datasets_test.py
tests/routers/openml/migration/datasets_migration_test.py

🚧 Files skipped from review as they are similar to previous changes (2)

tests/routers/openml/datasets_test.py
tests/routers/openml/migration/datasets_migration_test.py

Walkthrough

This PR adds test coverage for ontology support in dataset features. A new test test_dataset_features_with_ontology validates that ontology data is correctly retrieved and assigned to dataset features, checking specific ontology URLs for certain features while ensuring features without ontology data do not include the field. The datasets migration test is updated to normalize ontology fields in-place rather than removing them entirely, mirroring the approach used for nominal values.

Possibly related PRs

Adds ontology support #255: Implements the backend ontology support (database retrieval, schema field, and assignment to feature.ontology) that these test changes validate against.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding ontology support to the dataset features endpoint.
Description check	✅ Passed	The description directly relates to the changeset by referencing issue `#237` and explaining the feature implementation for ontology support.
Linked Issues check	✅ Passed	The changes implement ontology support in the dataset features endpoint by fetching from data_feature_description and normalizing values, fulfilling the requirements in issue `#237`.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to adding ontology support to dataset features, with no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

In get_feature_ontologies, consider using rows.mappings() and accessing columns via row["index"] / row["value"] (or row._mapping[...]) instead of row.index and row.value to avoid potential conflicts with Row.index and make the column access more explicit and robust.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `get_feature_ontologies`, consider using `rows.mappings()` and accessing columns via `row["index"]` / `row["value"]` (or `row._mapping[...]`) instead of `row.index` and `row.value` to avoid potential conflicts with `Row.index` and make the column access more explicit and robust.

## Individual Comments

### Comment 1
<location path="src/database/datasets.py" line_range="134-143" />
<code_context>
     return [Feature(**row, nominal_values=None) for row in rows.mappings()]


+def get_feature_ontologies(dataset_id: int, connection: Connection) -> dict[int, list[str]]:
+    """Return a mapping from feature index to its list of ontology URIs."""
+    rows = connection.execute(
+        text(
+            """
+            SELECT `index`, `value`
+            FROM data_feature_description
+            WHERE `did` = :dataset_id AND `description_type` = 'ontology'
+            """,
+        ),
+        parameters={"dataset_id": dataset_id},
+    )
+    ontologies: dict[int, list[str]] = {}
+    for row in rows:
+        ontologies.setdefault(row.index, []).append(row.value)
+    return ontologies
</code_context>
<issue_to_address>
**issue (bug_risk):** Accessing `row.index`/`row.value` directly on SQLAlchemy `Row` can be fragile; prefer `.mappings()` or `_mapping[...]` for robustness.

In this loop you’re iterating raw `Row` objects and relying on `row.index` / `row.value`, which can collide with `Row.index()` and other internals, and may change across SQLAlchemy versions. Use mapping access instead, e.g. iterate `rows.mappings()` and use `row["index"]` / `row["value"]`, or use `row._mapping["index"]` / `row._mapping["value"]` on the existing result to avoid these name clashes.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/database/datasets.py

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/routers/openml/datasets.py (1)

296-298: Optional: defer ontology query until after empty-feature guard.

Line 296 currently performs an extra read even when no features exist and the endpoint will raise immediately.

⚡ Small query-order optimization

-    ontologies = database.datasets.get_feature_ontologies(dataset_id, expdb)
-    for feature in features:
-        feature.ontology = ontologies.get(feature.index)
-
     if not features:
         processing_state = database.datasets.get_latest_processing_update(dataset_id, expdb)
         ...
         raise HTTPException(...)

+    ontologies = database.datasets.get_feature_ontologies(dataset_id, expdb)
+    for feature in features:
+        feature.ontology = ontologies.get(feature.index)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 296 - 298, The code calls
database.datasets.get_feature_ontologies(dataset_id, expdb) before checking if
there are any features, causing an unnecessary DB read; change the order so you
first check the features collection (e.g., if not features: return or raise as
the endpoint expects) and only call get_feature_ontologies(dataset_id, expdb)
when features is non-empty, then iterate and set feature.ontology =
ontologies.get(feature.index); this uses the existing symbols features,
dataset_id, expdb, and database.datasets.get_feature_ontologies.

tests/routers/openml/migration/datasets_migration_test.py (1)

225-228: Consider deduplicating legacy list-to-scalar normalization.

Line 224 and Line 228 apply the same conversion pattern; extracting a tiny helper would reduce drift in future parity tweaks.

♻️ Minimal refactor

+    def _legacy_scalar_or_list(values: list[str]) -> str | list[str]:
+        return values if len(values) > 1 else values[0]
+
     for feature in python_body:
         for key, value in list(feature.items()):
             if key == "nominal_values":
                 values = feature.pop(key)
-                feature["nominal_value"] = values if len(values) > 1 else values[0]
+                feature["nominal_value"] = _legacy_scalar_or_list(values)
             elif key == "ontology":
                 values = feature.pop(key)
-                feature["ontology"] = values if len(values) > 1 else values[0]
+                feature["ontology"] = _legacy_scalar_or_list(values)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/migration/datasets_migration_test.py` around lines 225 -
228, Extract the repeated "list-to-scalar" normalization into a small helper
(e.g., normalize_legacy_list(value) or normalize_list_to_scalar) and replace
both in-place patterns that operate on feature["ontology"] (the block that does
values = feature.pop(key); feature["ontology"] = values if len(values) > 1 else
values[0]) with a call to that helper so the code becomes feature["ontology"] =
normalize_legacy_list(feature.pop(key)); ensure the helper accepts the popped
value (either list or str) and returns the list when length > 1 or the single
element otherwise.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/database/datasets.py`:
- Around line 139-142: The SQL selecting ontology entries from
data_feature_description (WHERE `did` = :dataset_id AND `description_type` =
'ontology') lacks an ORDER BY, causing non-deterministic array ordering; update
that query (and the similar occurrence that also selects ontology rows) to
append "ORDER BY `index` ASC" (or ORDER BY `index`) so results are consistently
ordered by the `index` column—apply this change to both places that query
data_feature_description for description_type='ontology'.
- Around line 147-148: The loop uses positional attribute access on SQLAlchemy
Row (for row in rows: ontologies.setdefault(row.index, []).append(row.value)),
but row.index resolves to the Row.index() method in SQLAlchemy 2.x; change to
mapping-style access (e.g., row['index'] and row['value'] or
row._mapping['index'] / row._mapping['value']) so the dictionary key is the
actual column value; update the code that iterates over rows and uses
ontologies.setdefault to use these mapping keys instead.

---

Nitpick comments:
In `@src/routers/openml/datasets.py`:
- Around line 296-298: The code calls
database.datasets.get_feature_ontologies(dataset_id, expdb) before checking if
there are any features, causing an unnecessary DB read; change the order so you
first check the features collection (e.g., if not features: return or raise as
the endpoint expects) and only call get_feature_ontologies(dataset_id, expdb)
when features is non-empty, then iterate and set feature.ontology =
ontologies.get(feature.index); this uses the existing symbols features,
dataset_id, expdb, and database.datasets.get_feature_ontologies.

In `@tests/routers/openml/migration/datasets_migration_test.py`:
- Around line 225-228: Extract the repeated "list-to-scalar" normalization into
a small helper (e.g., normalize_legacy_list(value) or normalize_list_to_scalar)
and replace both in-place patterns that operate on feature["ontology"] (the
block that does values = feature.pop(key); feature["ontology"] = values if
len(values) > 1 else values[0]) with a call to that helper so the code becomes
feature["ontology"] = normalize_legacy_list(feature.pop(key)); ensure the helper
accepts the popped value (either list or str) and returns the list when length >
1 or the single element otherwise.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72989df and 5280c6a.

📒 Files selected for processing (5)

src/database/datasets.py
src/routers/openml/datasets.py
src/schemas/datasets/openml.py
tests/routers/openml/datasets_test.py
tests/routers/openml/migration/datasets_migration_test.py

src/database/datasets.py

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

for more information, see https://pre-commit.ci

codecov · 2026-03-16T13:46:08Z

Codecov Report

❌ Patch coverage is 25.00000% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@c617d6d). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/routers/openml/datasets.py	0.00%	3 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #262   +/-   ##
=======================================
  Coverage        ?   54.44%           
=======================================
  Files           ?       34           
  Lines           ?     1462           
  Branches        ?      119           
=======================================
  Hits            ?      796           
  Misses          ?      664           
  Partials        ?        2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sourcery-ai

Hey - I've left some high level feedback:

The new ontology test relies on magic dataset IDs and feature indices (e.g. dataset 11, features 1–3); consider using fixtures or named constants to make the intent clearer and reduce coupling to specific seeded data.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new ontology test relies on magic dataset IDs and feature indices (e.g. dataset 11, features 1–3); consider using fixtures or named constants to make the intent clearer and reduce coupling to specific seeded data.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

PGijsbers · 2026-03-16T13:56:17Z

Trying to trigger CI on the latest commit.

sourcery-ai bot reviewed Feb 28, 2026

View reviewed changes

src/database/datasets.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Feb 28, 2026

View reviewed changes

src/database/datasets.py Show resolved Hide resolved

src/database/datasets.py Outdated Show resolved Hide resolved

hxrshxz added 2 commits February 28, 2026 23:27

Add ontology field to dataset features endpoint

d300719

Use .mappings() row access and add ORDER BY in get_feature_ontologies

d132eb2

hxrshxz force-pushed the fix/237-ontology-support branch from ab592f7 to d132eb2 Compare February 28, 2026 17:58

PGijsbers mentioned this pull request Mar 16, 2026

Adds ontology support #255

Merged

Merge branch 'main' into fix/237-ontology-support

c708c7c

Copilot AI review requested due to automatic review settings March 16, 2026 13:42

Copilot AI reviewed Mar 16, 2026

View reviewed changes

[pre-commit.ci] auto fixes from pre-commit.com hooks

156a6af

for more information, see https://pre-commit.ci

remove duplication introduced with merge commit

0d79669

PGijsbers marked this pull request as draft March 16, 2026 13:49

PGijsbers marked this pull request as ready for review March 16, 2026 13:49

sourcery-ai bot reviewed Mar 16, 2026

View reviewed changes

PGijsbers closed this Mar 16, 2026

PGijsbers reopened this Mar 16, 2026

Trigger CI

480ff52

PGijsbers merged commit c9ada15 into openml:main Mar 16, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ontology support to dataset features endpoint#262

Add ontology support to dataset features endpoint#262
PGijsbers merged 6 commits intoopenml:mainfrom
hxrshxz:fix/237-ontology-support

hxrshxz commented Feb 28, 2026

Uh oh!

coderabbitai bot commented Feb 28, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

codecov bot commented Mar 16, 2026

Uh oh!

sourcery-ai bot left a comment

Uh oh!

PGijsbers commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

hxrshxz commented Feb 28, 2026

Uh oh!

coderabbitai bot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 16, 2026

Codecov Report

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

PGijsbers commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Feb 28, 2026 •

edited

Loading